This document presents a discussion descriptive statistical analysis for the Telco customer dataset shown in Section 3.3.
#### import telco customer data and libraries
source("up(telco).R")
## Loading required package: parsnip
## Loading required package: survival
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggsurvfit
## Loading required package: censored
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'plotly'
##
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
##
## The following object is masked from 'package:stats':
##
## filter
##
##
## The following object is masked from 'package:graphics':
##
## layout
##
##
##
## Attaching package: 'scales'
##
##
## The following object is masked from 'package:readr':
##
## col_factor
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## This is bayesplot version 1.10.0
##
## - Online documentation and vignettes at mc-stan.org/bayesplot
##
## - bayesplot theme set to bayesplot::theme_default()
##
## * Does _not_ affect other ggplot2 plots
##
## * See ?bayesplot_theme_set for details on theme setting
##
## Loading required package: gridExtra
##
##
## Attaching package: 'gridExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
The Telco customer churn data is available in IBM Business Analytics Community Connect XXXX and contains information about a Telco company that provides home phone and internet services to 7,043 customers in California in the United States. The dataset contains 18 variables about the profile of the customers and a variable indicating who have left or stayed the service and presents an approximate 73.4% censorship rate. The response variable tenure represents the number of months the customer has stayed with the company with average of 32 months. Lifetimes equal zero were removed resulting in a dataset of 7,032 customers. Demographics characteristics are included for each customer, as well as customer account information and service information. Description of the variables as follows:
Demographic Information
Gender: Whether the client is a female or a male (Female, Male).
SeniorCitizen: Whether the client is a senior citizen or not ( 0, 1).
Partner: Whether the client has a partner or not (Yes, No).
Dependents: Whether the client has dependents or not (Yes, No).
Services Information
PhoneService: Whether the client has a phone service or not (Yes, No).
MultipleLines: Whether the client has multiple lines or not (No phone service, No, Yes).
InternetServices: Whether the client is subscribed to Internet service with the company (DSL, Fiber optic, No)
OnlineSecurity: Whether the client has online security or not (No internet service, No, Yes).
OnlineBackup: Whether the client has online backup or not (No internet service, No, Yes).
DeviceProtection: Whether the client has device protection or not (No internet service, No, Yes).
TechSupport: Whether the client has tech support or not (No internet service, No, Yes).
StreamingTV: Whether the client has streaming TV or not (No internet service, No, Yes).
StreamingMovies: Whether the client has streaming movies or not (No internet service, No, Yes).
Customer Account Information
tenure: Number of months the customer has stayed with the company (Multiple different numeric values).
Contract: Indicates the customer’s current contract type (Month-to-Month, One year, Two year).
PaperlessBilling: Whether the client has paperless billing or not (Yes, No).
PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit Card (automatic)).
MontlyCharges: The amount charged to the customer monthly (Multiple different numeric values). ´
TotalCharges: The total amount charged to the customer (Multiple different numeric values).
rmarkdown::paged_table(df)
We present a descriptive analysis based on empirical marginal survival curves with the Kaplan-Meier estimator and evaluate the hazard curve for some variables including the categories demographics, service information, and customer account information.
dados <- df
#Kaplan-Meier fit (marginal)
fit.km1 <- survfit2(Surv(tenure, status) ~ gender, dados)
fit.km2 <- survfit2(Surv(tenure, status) ~ PaymentMethod, dados)
fit.km3 <- survfit2(Surv(tenure, status) ~ Contract, dados)
fit.km4 <- survfit2(Surv(tenure, status) ~ SeniorCitizen, dados)
fit.km5 <- survfit2(Surv(tenure, status) ~ InternetService, dados)
fit.km6 <- survfit2(Surv(tenure, status) ~ StreamingTV, dados)
fit.km8 <- survfit2(Surv(tenure, status) ~ Dependents, dados)
fit.km9 <- survfit2(Surv(tenure, status) ~ PaperlessBilling, dados)
fit.km10<- survfit2(Surv(tenure, status) ~ DeviceProtection, dados)
fit.km11<- survfit2(Surv(tenure, status) ~ PhoneService, dados)
fit.km12<- survfit2(Surv(tenure, status) ~ TechSupport, dados)
extrai_km <- function(fit_km) {
ret <- tidy_survfit(fit_km, type = "surv") %>%
select(.eval_time = time, .pred_survival = estimate, id = strata) %>%
group_by(id) %>%
mutate(haz = tx.emp(.eval_time, .pred_survival)) %>%
tidyr::nest(.pred = c(.eval_time, .pred_survival))
return(ret)
}
kms <- list(Gender=fit.km1, PaymentMethod=fit.km2, Contract=fit.km3,
SeniorCitizen= fit.km4, InternetService= fit.km5,
StreamingTV= fit.km6,Dependents=fit.km8, PaperlessBilling= fit.km9, DeviceProtection=fit.km10, PhoneService= fit.km11, TechSupport=fit.km12)
resultado= map(kms, extrai_km)
preds <-bind_rows(Gender=resultado$Gender, .id = "modelo") %>%
group_by(modelo) %>%
ungroup() %>%
tidyr::unnest(cols = .pred)
a1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
xlab("survival time (in months)") + ylab("survival probabilities")+
geom_step(linewidth=1) +
theme_bw() +
scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
a2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
xlab("survival time (in months)") + ylab("hazard rate")+
theme_bw() +
geom_line(linewidth=1) +
scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
PaymentMethod
For the payment method see that the behaviour of survival curves and churn rates for Credit Card (automatic), Bank Transfer (automatic) and Mailed Check are quite similar. On the other hand, when we consider the instantaneous lapse curve, notice that customers that preferable for an {Eletronic check} as a payment method are the majority to leave the company. In this case, we consider jointly these classes in a new category called ``others’’.
preds <-
bind_rows(PaymentMethod=resultado$PaymentMethod, .id = "modelo") %>%
group_by(modelo) %>%
ungroup() %>%
tidyr::unnest(cols = .pred)
b1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
xlab("survival time (in months)") + ylab("survival probabilities")+
geom_step(linewidth=1) +
theme_bw() +
scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
b2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
xlab("survival time (in months)") + ylab("hazard rate")+
theme_bw() +
geom_line(linewidth=1) +
scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
Contract
For the customer account information, see that customers with Month-to-Month contracts have higher lapse rates compared to clients with One-year and Two-year contracts. See that the hazard curve are very similar for the contract that has payment more than one year. In this way, we consider to aggregate in a unique group called {``One-year +’’}.
preds <-
bind_rows(Contract=resultado$Contract, .id = "modelo") %>%
group_by(modelo) %>%
ungroup() %>%
tidyr::unnest(cols = .pred)
c1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
xlab("survival time (in months)") + ylab("survival probabilities")+
geom_step(linewidth=1) +
theme_bw() +
scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
c2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
xlab("survival time (in months)") + ylab("hazard rate")+
theme_bw() +
geom_line(linewidth=1) +
scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
preds <-
bind_rows(SeniorCitizen=resultado$SeniorCitizen, .id = "modelo") %>%
group_by(modelo) %>%
ungroup() %>%
tidyr::unnest(cols = .pred)
d1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
xlab("survival time (in months)") + ylab("survival probabilities")+
geom_step(linewidth=1) +
theme_bw() +
scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
d2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
xlab("survival time (in months)") + ylab("hazard rate")+
theme_bw() +
geom_line(linewidth=1) +
scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
preds <-
bind_rows(InternetService=resultado$InternetService, .id = "modelo") %>%
group_by(modelo) %>%
ungroup() %>%
tidyr::unnest(cols = .pred)
e1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
xlab("survival time (in months)") + ylab("survival probabilities")+
geom_step(linewidth=1) +
theme_bw() +
scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
e2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
xlab("survival time (in months)") + ylab("hazard rate")+
theme_bw() +
geom_line(linewidth=1) +
scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
preds <-
bind_rows(StreamingTV=resultado$StreamingTV, .id = "modelo") %>%
group_by(modelo) %>%
ungroup() %>%
tidyr::unnest(cols = .pred)
f1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
xlab("survival time (in months)") + ylab("survival probabilities")+
geom_step(linewidth=1) +
theme_bw() +
scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
f2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
xlab("survival time (in months)") + ylab("hazard rate")+
theme_bw() +
geom_line(linewidth=1) +
scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
preds <-
bind_rows(PaperlessBilling=resultado$PaperlessBilling, .id = "modelo") %>%
group_by(modelo) %>%
ungroup() %>%
tidyr::unnest(cols = .pred)
h1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
xlab("survival time (in months)") + ylab("survival probabilities")+
geom_step(linewidth=1) +
theme_bw() +
scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
h2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
xlab("survival time (in months)") + ylab("hazard rate")+
theme_bw() +
geom_line(linewidth=1) +
scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
preds <-
bind_rows(DeviceProtection=resultado$DeviceProtection, .id = "modelo") %>%
group_by(modelo) %>%
ungroup() %>%
tidyr::unnest(cols = .pred)
i1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
xlab("survival time (in months)") + ylab("survival probabilities")+
geom_step(linewidth=1) +
theme_bw() +
scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
i2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
xlab("survival time (in months)") + ylab("hazard rate")+
theme_bw() +
geom_line(linewidth=1) +
scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
preds <-
bind_rows(PhoneService=resultado$PhoneService, .id = "modelo") %>%
group_by(modelo) %>%
ungroup() %>%
tidyr::unnest(cols = .pred)
j1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
xlab("survival time (in months)") + ylab("survival probabilities")+
geom_step(linewidth=1) +
theme_bw() +
scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
j2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
xlab("survival time (in months)") + ylab("hazard rate")+
theme_bw() +
geom_line(linewidth=1) +
scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)
The lapse rate for the two categories of Gender in the data are nearly the same. Similar behaviour is observed for some customer service information such as PhoneService, MutipleLines and OnlineSecurity. Also, there is multicollinearity for groups of variables, in particular OnlineSecurity, TechSupport, OnlineBackup and DeviceProtection are all dependent on the OnlineService variable. Also, the covariates MonthlyCharges and TotalCharges present high collinearity.
sp1= ggplot(aes(x=MonthlyCharges, y=TotalCharges), data=df) +
theme_bw() +
geom_point()
ggplotly(sp1)